Pub Date : 2024-04-29DOI: 10.1016/j.image.2024.117130
Di Li , Mou Wang , Susanto Rahardja
Most existing tone mapping operators (TMOs) are developed based on prior assumptions of human visual system, and they are known to be sensitive to hyperparameters. In this paper, we proposed a straightforward yet efficient framework to automatically learn the priors and perform tone mapping in an end-to-end manner. The proposed algorithm utilizes a contrastive learning framework to enforce the content consistency between high dynamic range (HDR) inputs and low dynamic range (LDR) outputs. Since contrastive learning aims at maximizing the mutual information across different domains, no paired images or labels are required in our algorithm. Equipped with an attention-based U-Net to alleviate the aliasing and halo artifacts, our algorithm can produce sharp and visually appealing images over various complex real-world scenes, indicating that the proposed algorithm can be used as a strong baseline for future HDR image tone mapping task. Extensive experiments as well as subjective evaluations demonstrated that the proposed algorithm outperforms the existing state-of-the-art algorithms qualitatively and quantitatively. The code is available at https://github.com/xslidi/CATMO.
{"title":"Contrastive learning for deep tone mapping operator","authors":"Di Li , Mou Wang , Susanto Rahardja","doi":"10.1016/j.image.2024.117130","DOIUrl":"https://doi.org/10.1016/j.image.2024.117130","url":null,"abstract":"<div><p>Most existing tone mapping operators (TMOs) are developed based on prior assumptions of human visual system, and they are known to be sensitive to hyperparameters. In this paper, we proposed a straightforward yet efficient framework to automatically learn the priors and perform tone mapping in an end-to-end manner. The proposed algorithm utilizes a contrastive learning framework to enforce the content consistency between high dynamic range (HDR) inputs and low dynamic range (LDR) outputs. Since contrastive learning aims at maximizing the mutual information across different domains, no paired images or labels are required in our algorithm. Equipped with an attention-based U-Net to alleviate the aliasing and halo artifacts, our algorithm can produce sharp and visually appealing images over various complex real-world scenes, indicating that the proposed algorithm can be used as a strong baseline for future HDR image tone mapping task. Extensive experiments as well as subjective evaluations demonstrated that the proposed algorithm outperforms the existing state-of-the-art algorithms qualitatively and quantitatively. The code is available at <span>https://github.com/xslidi/CATMO</span><svg><path></path></svg>.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"126 ","pages":"Article 117130"},"PeriodicalIF":3.5,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140894882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the harsh and unknown marine environment and the limited diving ability of human beings, underwater robots become an important role in ocean exploration and development. However, the performance of underwater robots is limited by blurred images, low contrast and color deviation, which are resulted from complex underwater imaging environments. The existing mainstream object detection networks perform poorly when applied directly to underwater tasks. Although using a cascaded detector network can get high accuracy, the inference speed is too slow to apply to actual tasks. To address the above problems, this paper proposes a lightweight and accurate one-stage underwater object detection network, called U-ATSS. Firstly, we compressed the backbone of ATSS to significantly reduce the number of network parameters and improve the inference speed without losing the detection accuracy, to achieve lightweight and real-time performance of the underwater object detection network. Then, we propose a plug-and-play receptive field module F-ASPP, which can obtain larger receptive fields and richer spatial information, and optimize the learning rate strategy as well as classification loss function to significantly improve the detection accuracy and convergence speed. We evaluated and compared U-ATSS with other methods on the Kesci Underwater Object Detection Algorithm Competition dataset containing a variety of marine organisms. The experimental results show that U-ATSS not only has obvious lightweight characteristics, but also shows excellent performance and competitiveness in terms of detection accuracy.
{"title":"U-ATSS: A lightweight and accurate one-stage underwater object detection network","authors":"Junjun Wu, Jinpeng Chen, Qinghua Lu, Jiaxi Li, Ningwei Qin, Kaixuan Chen, Xilin Liu","doi":"10.1016/j.image.2024.117137","DOIUrl":"https://doi.org/10.1016/j.image.2024.117137","url":null,"abstract":"<div><p>Due to the harsh and unknown marine environment and the limited diving ability of human beings, underwater robots become an important role in ocean exploration and development. However, the performance of underwater robots is limited by blurred images, low contrast and color deviation, which are resulted from complex underwater imaging environments. The existing mainstream object detection networks perform poorly when applied directly to underwater tasks. Although using a cascaded detector network can get high accuracy, the inference speed is too slow to apply to actual tasks. To address the above problems, this paper proposes a lightweight and accurate one-stage underwater object detection network, called U-ATSS. Firstly, we compressed the backbone of ATSS to significantly reduce the number of network parameters and improve the inference speed without losing the detection accuracy, to achieve lightweight and real-time performance of the underwater object detection network. Then, we propose a plug-and-play receptive field module F-ASPP, which can obtain larger receptive fields and richer spatial information, and optimize the learning rate strategy as well as classification loss function to significantly improve the detection accuracy and convergence speed. We evaluated and compared U-ATSS with other methods on the Kesci Underwater Object Detection Algorithm Competition dataset containing a variety of marine organisms. The experimental results show that U-ATSS not only has obvious lightweight characteristics, but also shows excellent performance and competitiveness in terms of detection accuracy.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"126 ","pages":"Article 117137"},"PeriodicalIF":3.5,"publicationDate":"2024-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140906614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-27DOI: 10.1016/j.image.2024.117134
Debjit Das, Ruchira Naskar
Digital image forgery has become hugely widespread, as numerous easy-to-use, low-cost image manipulation tools have become widely available to the common masses. Such forged images can be used with various malicious intentions, such as to harm the social reputation of renowned personalities, to perform identity fraud resulting in financial disasters, and many more illegitimate activities. Image splicing is a form of image forgery where an adversary intelligently combines portions from multiple source images to generate a natural-looking artificial image. Detection of image splicing attacks poses an open challenge in the forensic domain, and in recent literature, several significant research findings on image splicing detection have been described. However, the number of features documented in such works is significantly huge. Our aim in this work is to address the issue of feature set optimization while modeling image splicing detection as a classification problem and preserving the forgery detection efficiency reported in the state-of-the-art. This paper proposes an image-splicing detection scheme based on textural features and Haralick features computed from the input image’s Gray Level Co-occurrence Matrix (GLCM) and also localizes the spliced regions in a detected spliced image. We have explored the well-known Columbia Image Splicing Detection Evaluation Dataset and the DSO-1 dataset, which is more challenging because of its constituent post-processed color images. Experimental results prove that our proposed model obtains 95% accuracy in image splicing detection with an AUC score of 0.99, with an optimized feature set of dimensionality of 15 only.
{"title":"Image splicing detection using low-dimensional feature vector of texture features and Haralick features based on Gray Level Co-occurrence Matrix","authors":"Debjit Das, Ruchira Naskar","doi":"10.1016/j.image.2024.117134","DOIUrl":"https://doi.org/10.1016/j.image.2024.117134","url":null,"abstract":"<div><p><em>Digital image forgery</em> has become hugely widespread, as numerous easy-to-use, low-cost image manipulation tools have become widely available to the common masses. Such forged images can be used with various malicious intentions, such as to harm the social reputation of renowned personalities, to perform identity fraud resulting in financial disasters, and many more illegitimate activities. <em>Image splicing</em> is a form of image forgery where an adversary intelligently combines portions from multiple source images to generate a natural-looking artificial image. Detection of image splicing attacks poses an open challenge in the forensic domain, and in recent literature, several significant research findings on image splicing detection have been described. However, the number of features documented in such works is significantly huge. Our aim in this work is to address the issue of feature set optimization while modeling image splicing detection as a classification problem and preserving the forgery detection efficiency reported in the state-of-the-art. This paper proposes an image-splicing detection scheme based on textural features and Haralick features computed from the input image’s Gray Level Co-occurrence Matrix (GLCM) and also localizes the spliced regions in a detected spliced image. We have explored the well-known Columbia Image Splicing Detection Evaluation Dataset and the DSO-1 dataset, which is more challenging because of its constituent post-processed color images. Experimental results prove that our proposed model obtains 95% accuracy in image splicing detection with an AUC score of 0.99, with an optimized feature set of dimensionality of 15 only.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"125 ","pages":"Article 117134"},"PeriodicalIF":3.5,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140816438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-24DOI: 10.1016/j.image.2024.117132
Qianyu Wu , Zhongqian Hu , Aichun Zhu , Hui Tang , Jiaxin Zou , Yan Xi , Yang Chen
Single image super-resolution (SISR) is still an important while challenging task. Existing methods usually ignore the diversity of generated Super-Resolution (SR) images. The fine details of the corresponding high-resolution (HR) images cannot be confidently recovered due to the degradation of detail in low-resolution (LR) images. To address the above issue, this paper presents a flow-based multi-scale learning network (FMLnet) to explore the diverse mapping spaces for SR. First, we propose a multi-scale learning block (MLB) to extract the underlying features of the LR image. Second, the introduced pixel-wise multi-head attention allows our model to map multiple representation subspaces simultaneously. Third, by employing a normalizing flow module for a given LR input, our approach generates various stochastic SR outputs with high visual quality. The trade-off between fidelity and perceptual quality can be controlled. Finally, the experimental results on five datasets demonstrate that the proposed network outperforms the existing methods in terms of diversity, and achieves competitive PSNR/SSIM results. Code is available at https://github.com/qianyuwu/FMLnet.
单幅图像超分辨率(SISR)仍然是一项重要而又具有挑战性的任务。现有方法通常会忽略生成的超分辨率(SR)图像的多样性。由于低分辨率(LR)图像的细节退化,相应的高分辨率(HR)图像的精细细节无法可靠地恢复。为解决上述问题,本文提出了一种基于流的多尺度学习网络(FMLnet)来探索 SR 的不同映射空间。首先,我们提出了一个多尺度学习块(MLB)来提取 LR 图像的底层特征。其次,引入的像素多头注意力使我们的模型能够同时映射多个表示子空间。第三,通过对给定的 LR 输入采用归一化流模块,我们的方法可以生成各种具有高视觉质量的随机 SR 输出。保真度和感知质量之间的权衡是可以控制的。最后,在五个数据集上的实验结果表明,所提出的网络在多样性方面优于现有方法,并取得了具有竞争力的 PSNR/SSIM 结果。代码见 https://github.com/qianyuwu/FMLnet。
{"title":"A flow-based multi-scale learning network for single image stochastic super-resolution","authors":"Qianyu Wu , Zhongqian Hu , Aichun Zhu , Hui Tang , Jiaxin Zou , Yan Xi , Yang Chen","doi":"10.1016/j.image.2024.117132","DOIUrl":"10.1016/j.image.2024.117132","url":null,"abstract":"<div><p>Single image super-resolution (SISR) is still an important while challenging task. Existing methods usually ignore the diversity of generated Super-Resolution (SR) images. The fine details of the corresponding high-resolution (HR) images cannot be confidently recovered due to the degradation of detail in low-resolution (LR) images. To address the above issue, this paper presents a flow-based multi-scale learning network (FMLnet) to explore the diverse mapping spaces for SR. First, we propose a multi-scale learning block (MLB) to extract the underlying features of the LR image. Second, the introduced pixel-wise multi-head attention allows our model to map multiple representation subspaces simultaneously. Third, by employing a normalizing flow module for a given LR input, our approach generates various stochastic SR outputs with high visual quality. The trade-off between fidelity and perceptual quality can be controlled. Finally, the experimental results on five datasets demonstrate that the proposed network outperforms the existing methods in terms of diversity, and achieves competitive PSNR/SSIM results. Code is available at <span>https://github.com/qianyuwu/FMLnet</span><svg><path></path></svg>.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"125 ","pages":"Article 117132"},"PeriodicalIF":3.5,"publicationDate":"2024-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140760491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-21DOI: 10.1016/j.image.2024.117131
Dongyang Li , Rencan Nie , Jinde Cao , Gucheng Zhang , Biaojian Jin
Existing methods for infrared and visible image fusion (IVIF) often overlook the analysis of common and distinct features among source images. Consequently, this study develops A self-supervised infrared and visible image fusion based on co-attention network, incorporating auxiliary networks and backbone networks in its design. The primary concept is to transform both common and distinct features into common features and reconstructed features, subsequently deriving the distinct features through their subtraction. To enhance the similarity of common features, we designed the fusion block based on co-attention (FBC) module specifically for this purpose, capturing common features through co-attention. Moreover, fine-tuning the auxiliary network enhances the image reconstruction effectiveness of the backbone network. It is noteworthy that the auxiliary network is exclusively employed during training to guide the self-supervised completion of IVIF by the backbone network. Additionally, we introduce a novel estimate for weighted fidelity loss to guide the fused image in preserving more brightness from the source image. Experiments conducted on diverse benchmark datasets demonstrate the superior performance of our S2CANet over state-of-the-art IVIF methods.
{"title":"S2CANet: A self-supervised infrared and visible image fusion based on co-attention network","authors":"Dongyang Li , Rencan Nie , Jinde Cao , Gucheng Zhang , Biaojian Jin","doi":"10.1016/j.image.2024.117131","DOIUrl":"10.1016/j.image.2024.117131","url":null,"abstract":"<div><p>Existing methods for infrared and visible image fusion (IVIF) often overlook the analysis of common and distinct features among source images. Consequently, this study develops A self-supervised infrared and visible image fusion based on co-attention network, incorporating auxiliary networks and backbone networks in its design. The primary concept is to transform both common and distinct features into common features and reconstructed features, subsequently deriving the distinct features through their subtraction. To enhance the similarity of common features, we designed the fusion block based on co-attention (FBC) module specifically for this purpose, capturing common features through co-attention. Moreover, fine-tuning the auxiliary network enhances the image reconstruction effectiveness of the backbone network. It is noteworthy that the auxiliary network is exclusively employed during training to guide the self-supervised completion of IVIF by the backbone network. Additionally, we introduce a novel estimate for weighted fidelity loss to guide the fused image in preserving more brightness from the source image. Experiments conducted on diverse benchmark datasets demonstrate the superior performance of our S2CANet over state-of-the-art IVIF methods.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"125 ","pages":"Article 117131"},"PeriodicalIF":3.5,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140796369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-12DOI: 10.1016/j.image.2024.117129
Feng Yang , Yichao Cao , Qifan Xue , Shuai Jin , Xuanpeng Li , Weigong Zhang
The distinguishable deep features are essential for the 3D point cloud recognition as they influence the search for the optimal classifier. Most existing point cloud classification methods mainly focus on local information aggregation while ignoring the feature distribution of the whole dataset that indicates more informative and intrinsic semantic relationships of labeled data, if better exploited, which could learn more distinguishing inter-class features. Our work attempts to construct a more distinguishable feature space through performing feature distribution refinement inspired by contrastive learning and sample mining strategies, without modifying the model architecture. To explore the full potential of feature distribution refinement, two modules are involved to boost exceptionally distributed samples distinguishability in an adaptive manner: (i) Confusion-Prone Classes Mining (CPCM) module is aimed at hard-to-distinct classes, which alleviates the massive category-level confusion by generating class-level soft labels; (ii) Entropy-Aware Attention (EAA) mechanism is proposed to remove influence of the trivial cases which could substantially weaken model performance. Our method achieves competitive results on multiple applications of point cloud. In particular, our method gets 85.8% accuracy on ScanObjectNN, and substantial performance gains up to 2.7% in DCGNN, 3.1% in PointNet++, and 2.4% in GBNet. Our code is available at https://github.com/YangFengSEU/CEDR.
{"title":"CEDR: Contrastive Embedding Distribution Refinement for 3D point cloud representation","authors":"Feng Yang , Yichao Cao , Qifan Xue , Shuai Jin , Xuanpeng Li , Weigong Zhang","doi":"10.1016/j.image.2024.117129","DOIUrl":"10.1016/j.image.2024.117129","url":null,"abstract":"<div><p>The distinguishable deep features are essential for the 3D point cloud recognition as they influence the search for the optimal classifier. Most existing point cloud classification methods mainly focus on local information aggregation while ignoring the feature distribution of the whole dataset that indicates more informative and intrinsic semantic relationships of labeled data, if better exploited, which could learn more distinguishing inter-class features. Our work attempts to construct a more distinguishable feature space through performing feature distribution refinement inspired by contrastive learning and sample mining strategies, without modifying the model architecture. To explore the full potential of feature distribution refinement, two modules are involved to boost exceptionally distributed samples distinguishability in an adaptive manner: (i) Confusion-Prone Classes Mining (CPCM) module is aimed at hard-to-distinct classes, which alleviates the massive category-level confusion by generating class-level soft labels; (ii) Entropy-Aware Attention (EAA) mechanism is proposed to remove influence of the trivial cases which could substantially weaken model performance. Our method achieves competitive results on multiple applications of point cloud. In particular, our method gets 85.8% accuracy on ScanObjectNN, and substantial performance gains up to 2.7% in DCGNN, 3.1% in PointNet++, and 2.4% in GBNet. Our code is available at <span>https://github.com/YangFengSEU/CEDR</span><svg><path></path></svg>.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"125 ","pages":"Article 117129"},"PeriodicalIF":3.5,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140767552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-12DOI: 10.1016/j.image.2024.117128
Daehyeon Kong , Kyeongbo Kong , Suk-Ju Kang
In recent years, deep neural networks pretrained on large-scale datasets have been used to address data deficiency and achieve better performance through prior knowledge. Contrastive language–image pretraining (CLIP), a vision-language model pretrained on an extensive dataset, achieves better performance in image recognition. In this study, we harness the power of multimodality in image clustering tasks, shifting from a single modality to a multimodal framework using the describability property of image encoder of the CLIP model. The importance of this shift lies in the ability of multimodality to provide richer feature representations. By generating text centroids corresponding to image features, we effectively create a common descriptive language for each cluster. It generates text centroids assigned by the image features and improves the clustering performance. The text centroids use the results generated by using the standard clustering algorithm as a pseudo-label and learn a common description of each cluster. Finally, only text centroids were added when the image features on the same space were assigned to the text centroids, but the clustering performance improved significantly compared to the standard clustering algorithm, especially on complex datasets. When the proposed method is applied, the normalized mutual information score rises by 32% on the Stanford40 dataset and 64% on ImageNet-Dog compared to the -means clustering algorithm.
{"title":"Image clustering using generated text centroids","authors":"Daehyeon Kong , Kyeongbo Kong , Suk-Ju Kang","doi":"10.1016/j.image.2024.117128","DOIUrl":"https://doi.org/10.1016/j.image.2024.117128","url":null,"abstract":"<div><p>In recent years, deep neural networks pretrained on large-scale datasets have been used to address data deficiency and achieve better performance through prior knowledge. Contrastive language–image pretraining (CLIP), a vision-language model pretrained on an extensive dataset, achieves better performance in image recognition. In this study, we harness the power of multimodality in image clustering tasks, shifting from a single modality to a multimodal framework using the describability property of image encoder of the CLIP model. The importance of this shift lies in the ability of multimodality to provide richer feature representations. By generating text centroids corresponding to image features, we effectively create a common descriptive language for each cluster. It generates text centroids assigned by the image features and improves the clustering performance. The text centroids use the results generated by using the standard clustering algorithm as a pseudo-label and learn a common description of each cluster. Finally, only text centroids were added when the image features on the same space were assigned to the text centroids, but the clustering performance improved significantly compared to the standard clustering algorithm, especially on complex datasets. When the proposed method is applied, the normalized mutual information score rises by 32% on the Stanford40 dataset and 64% on ImageNet-Dog compared to the <span><math><mi>k</mi></math></span>-means clustering algorithm.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"125 ","pages":"Article 117128"},"PeriodicalIF":3.5,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140555139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-30DOI: 10.1016/j.image.2024.117120
Yanjun Liu, Wenming Yang
Indoor 3D object detection is an essential task in single image scene understanding, impacting spatial cognition fundamentally in visual reasoning. Existing works on 3D object detection from a single image either pursue this goal through independent predictions of each object or implicitly reason over all possible objects, failing to harness relational geometric information between objects. To address this problem, we propose a sparse graph-based pipeline named Explicit3D based on object geometry and semantics features. Taking the efficiency into consideration, we further define a relatedness score and design a novel dynamic pruning method via group sampling for sparse scene graph generation and updating. Furthermore, our Explicit3D introduces homogeneous matrices and defines new relative loss and corner loss to model the spatial difference between target pairs explicitly. Instead of using ground-truth labels as direct supervision, our relative and corner loss are derived from homogeneous transforms, which renders the model to learn the geometric consistency between objects. The experimental results on the SUN RGB-D dataset demonstrate that our Explicit3D achieves better performance balance than the-state-of-the-art.
室内三维物体检测是单幅图像场景理解中的一项重要任务,对视觉推理中的空间认知有着根本性的影响。现有的单幅图像三维物体检测方法要么是通过对每个物体进行独立预测来实现这一目标,要么是对所有可能的物体进行隐式推理,无法利用物体之间的几何关系信息。为了解决这个问题,我们提出了一种基于稀疏图的管道,命名为基于物体几何和语义特征的 Explicit3D。考虑到效率问题,我们进一步定义了相关性得分,并设计了一种新颖的动态剪枝方法,通过分组采样来生成和更新稀疏场景图。此外,我们的 Explicit3D 还引入了同质矩阵,并定义了新的相对损失和角损失,以明确模拟目标对之间的空间差异。我们的相对损失和边角损失不是使用地面真实标签作为直接监督,而是从同质变换中导出,从而使模型能够学习物体之间的几何一致性。在 SUN RGB-D 数据集上的实验结果表明,我们的 Explicit3D 比最先进的技术取得了更好的性能平衡。
{"title":"Explicit3D: Graph network with spatial inference for single image 3D object detection","authors":"Yanjun Liu, Wenming Yang","doi":"10.1016/j.image.2024.117120","DOIUrl":"https://doi.org/10.1016/j.image.2024.117120","url":null,"abstract":"<div><p>Indoor 3D object detection is an essential task in single image scene understanding, impacting spatial cognition fundamentally in visual reasoning. Existing works on 3D object detection from a single image either pursue this goal through independent predictions of each object or implicitly reason over all possible objects, failing to harness relational geometric information between objects. To address this problem, we propose a sparse graph-based pipeline named Explicit3D based on object geometry and semantics features. Taking the efficiency into consideration, we further define a relatedness score and design a novel dynamic pruning method via group sampling for sparse scene graph generation and updating. Furthermore, our Explicit3D introduces homogeneous matrices and defines new relative loss and corner loss to model the spatial difference between target pairs explicitly. Instead of using ground-truth labels as direct supervision, our relative and corner loss are derived from homogeneous transforms, which renders the model to learn the geometric consistency between objects. The experimental results on the SUN RGB-D dataset demonstrate that our Explicit3D achieves better performance balance than the-state-of-the-art.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"124 ","pages":"Article 117120"},"PeriodicalIF":3.5,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140533524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-20DOI: 10.1016/j.image.2024.117117
Israr Hussain , Shunquan Tan , Jiwu Huang
Cropping an image is a common image editing technique that aims to find viewpoints with suitable image composition. It is also a frequently used post-processing technique to reduce the evidence of tampering in an image. Detecting cropped images poses a significant challenge in the field of digital image forensics, as the distortions introduced by image cropping are often imperceptible to the human eye. Although deep neural networks achieve state-of-the-art performance, due to their ability to encode large-scale data and handle billions of model parameters. However, due to their high computational complexity and substantial storage requirements, it is difficult to deploy these large deep learning models on resource-constrained devices such as mobile phones and embedded systems. To address this issue, we propose a lightweight deep learning framework for cropping detection in spatial domain, based on knowledge distillation. Initially, we constructed four datasets containing a total of 60,000 images cropped using various tools. We then used Efficient-Net-B0, pre-trained on ImageNet with significant surgical adjustments, as the teacher model, which makes it more robust and faster to converge in this downstream task. The model was trained on 20,000 cropped and uncropped images from our own dataset, and we then applied its knowledge to a more compact model called the student model. Finally, we selected the best-performing lightweight model as the final prediction model, with a testing accuracy of 98.44% on the test dataset, which outperforms other methods. Extensive experiments demonstrate that our proposed model, distilled from Efficient-Net-B0, achieves state-of-the-art performance in terms of detection accuracy, training parameters, and FLOPs, outperforming existing methods in detecting cropped images.
{"title":"A knowledge distillation based deep learning framework for cropped images detection in spatial domain","authors":"Israr Hussain , Shunquan Tan , Jiwu Huang","doi":"10.1016/j.image.2024.117117","DOIUrl":"10.1016/j.image.2024.117117","url":null,"abstract":"<div><p>Cropping an image is a common image editing technique that aims to find viewpoints with suitable image composition. It is also a frequently used post-processing technique to reduce the evidence of tampering in an image. Detecting cropped images poses a significant challenge in the field of digital image forensics, as the distortions introduced by image cropping are often imperceptible to the human eye. Although deep neural networks achieve state-of-the-art performance, due to their ability to encode large-scale data and handle billions of model parameters. However, due to their high computational complexity and substantial storage requirements, it is difficult to deploy these large deep learning models on resource-constrained devices such as mobile phones and embedded systems. To address this issue, we propose a lightweight deep learning framework for cropping detection in spatial domain, based on knowledge distillation. Initially, we constructed four datasets containing a total of 60,000 images cropped using various tools. We then used Efficient-Net-B0, pre-trained on ImageNet with significant surgical adjustments, as the teacher model, which makes it more robust and faster to converge in this downstream task. The model was trained on 20,000 cropped and uncropped images from our own dataset, and we then applied its knowledge to a more compact model called the student model. Finally, we selected the best-performing lightweight model as the final prediction model, with a testing accuracy of 98.44% on the test dataset, which outperforms other methods. Extensive experiments demonstrate that our proposed model, distilled from Efficient-Net-B0, achieves state-of-the-art performance in terms of detection accuracy, training parameters, and FLOPs, outperforming existing methods in detecting cropped images.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"124 ","pages":"Article 117117"},"PeriodicalIF":3.5,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140271577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-15DOI: 10.1016/j.image.2024.117116
Che-Wei Lee
For immediate access and online backup, cloud storage has become a mainstream way for digital data storage and distribution. To assure images accessed or downloaded from clouds are reliable is critical to storage service providers. In this study, a new distortion-free color image authentication method based on secret sharing, data compression and image interpolation with the tampering recovery capability is proposed. The proposed method generates elaborate authentication signals which have double functions of tampering localization and image repairing. The authentication signals are subsequently converted into many shares with the use of (k, n)-threshold method so as to increase the multiplicity of authentication signals for reinforcing the capability of tampering recovery. These shares are then randomly concealed in the alpha channel of the to-be-protected image that has been transformed into the PNG format containing RGBA channels. In authentication, the authentication signals computed from the alpha channel are not only used for indicating if an image block has been tampered with or not, but used as a signal to find the corresponding color in a predefined palette to recover the tampered image block. Compared with several state-of-the-art methods, the proposed method attains positive properties including losslessness, tampering localization and tampering recovery. Experimental results and discussions on security consideration and comparison with other related methods are provided to demonstrate the outperformance of the proposed method.
{"title":"A distortion-free authentication method for color images with tampering localization and self-recovery","authors":"Che-Wei Lee","doi":"10.1016/j.image.2024.117116","DOIUrl":"10.1016/j.image.2024.117116","url":null,"abstract":"<div><p>For immediate access and online backup, cloud storage has become a mainstream way for digital data storage and distribution. To assure images accessed or downloaded from clouds are reliable is critical to storage service providers. In this study, a new distortion-free color image authentication method based on secret sharing, data compression and image interpolation with the tampering recovery capability is proposed. The proposed method generates elaborate authentication signals which have double functions of tampering localization and image repairing. The authentication signals are subsequently converted into many shares with the use of (<em>k, n</em>)-threshold method so as to increase the multiplicity of authentication signals for reinforcing the capability of tampering recovery. These shares are then randomly concealed in the alpha channel of the to-be-protected image that has been transformed into the PNG format containing RGBA channels. In authentication, the authentication signals computed from the alpha channel are not only used for indicating if an image block has been tampered with or not, but used as a signal to find the corresponding color in a predefined palette to recover the tampered image block. Compared with several state-of-the-art methods, the proposed method attains positive properties including losslessness, tampering localization and tampering recovery. Experimental results and discussions on security consideration and comparison with other related methods are provided to demonstrate the outperformance of the proposed method.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"124 ","pages":"Article 117116"},"PeriodicalIF":3.5,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140204243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}