Pub Date : 2024-04-21DOI: 10.1016/j.image.2024.117131
Dongyang Li , Rencan Nie , Jinde Cao , Gucheng Zhang , Biaojian Jin
Existing methods for infrared and visible image fusion (IVIF) often overlook the analysis of common and distinct features among source images. Consequently, this study develops A self-supervised infrared and visible image fusion based on co-attention network, incorporating auxiliary networks and backbone networks in its design. The primary concept is to transform both common and distinct features into common features and reconstructed features, subsequently deriving the distinct features through their subtraction. To enhance the similarity of common features, we designed the fusion block based on co-attention (FBC) module specifically for this purpose, capturing common features through co-attention. Moreover, fine-tuning the auxiliary network enhances the image reconstruction effectiveness of the backbone network. It is noteworthy that the auxiliary network is exclusively employed during training to guide the self-supervised completion of IVIF by the backbone network. Additionally, we introduce a novel estimate for weighted fidelity loss to guide the fused image in preserving more brightness from the source image. Experiments conducted on diverse benchmark datasets demonstrate the superior performance of our S2CANet over state-of-the-art IVIF methods.
{"title":"S2CANet: A self-supervised infrared and visible image fusion based on co-attention network","authors":"Dongyang Li , Rencan Nie , Jinde Cao , Gucheng Zhang , Biaojian Jin","doi":"10.1016/j.image.2024.117131","DOIUrl":"10.1016/j.image.2024.117131","url":null,"abstract":"<div><p>Existing methods for infrared and visible image fusion (IVIF) often overlook the analysis of common and distinct features among source images. Consequently, this study develops A self-supervised infrared and visible image fusion based on co-attention network, incorporating auxiliary networks and backbone networks in its design. The primary concept is to transform both common and distinct features into common features and reconstructed features, subsequently deriving the distinct features through their subtraction. To enhance the similarity of common features, we designed the fusion block based on co-attention (FBC) module specifically for this purpose, capturing common features through co-attention. Moreover, fine-tuning the auxiliary network enhances the image reconstruction effectiveness of the backbone network. It is noteworthy that the auxiliary network is exclusively employed during training to guide the self-supervised completion of IVIF by the backbone network. Additionally, we introduce a novel estimate for weighted fidelity loss to guide the fused image in preserving more brightness from the source image. Experiments conducted on diverse benchmark datasets demonstrate the superior performance of our S2CANet over state-of-the-art IVIF methods.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"125 ","pages":"Article 117131"},"PeriodicalIF":3.5,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140796369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-12DOI: 10.1016/j.image.2024.117129
Feng Yang , Yichao Cao , Qifan Xue , Shuai Jin , Xuanpeng Li , Weigong Zhang
The distinguishable deep features are essential for the 3D point cloud recognition as they influence the search for the optimal classifier. Most existing point cloud classification methods mainly focus on local information aggregation while ignoring the feature distribution of the whole dataset that indicates more informative and intrinsic semantic relationships of labeled data, if better exploited, which could learn more distinguishing inter-class features. Our work attempts to construct a more distinguishable feature space through performing feature distribution refinement inspired by contrastive learning and sample mining strategies, without modifying the model architecture. To explore the full potential of feature distribution refinement, two modules are involved to boost exceptionally distributed samples distinguishability in an adaptive manner: (i) Confusion-Prone Classes Mining (CPCM) module is aimed at hard-to-distinct classes, which alleviates the massive category-level confusion by generating class-level soft labels; (ii) Entropy-Aware Attention (EAA) mechanism is proposed to remove influence of the trivial cases which could substantially weaken model performance. Our method achieves competitive results on multiple applications of point cloud. In particular, our method gets 85.8% accuracy on ScanObjectNN, and substantial performance gains up to 2.7% in DCGNN, 3.1% in PointNet++, and 2.4% in GBNet. Our code is available at https://github.com/YangFengSEU/CEDR.
{"title":"CEDR: Contrastive Embedding Distribution Refinement for 3D point cloud representation","authors":"Feng Yang , Yichao Cao , Qifan Xue , Shuai Jin , Xuanpeng Li , Weigong Zhang","doi":"10.1016/j.image.2024.117129","DOIUrl":"10.1016/j.image.2024.117129","url":null,"abstract":"<div><p>The distinguishable deep features are essential for the 3D point cloud recognition as they influence the search for the optimal classifier. Most existing point cloud classification methods mainly focus on local information aggregation while ignoring the feature distribution of the whole dataset that indicates more informative and intrinsic semantic relationships of labeled data, if better exploited, which could learn more distinguishing inter-class features. Our work attempts to construct a more distinguishable feature space through performing feature distribution refinement inspired by contrastive learning and sample mining strategies, without modifying the model architecture. To explore the full potential of feature distribution refinement, two modules are involved to boost exceptionally distributed samples distinguishability in an adaptive manner: (i) Confusion-Prone Classes Mining (CPCM) module is aimed at hard-to-distinct classes, which alleviates the massive category-level confusion by generating class-level soft labels; (ii) Entropy-Aware Attention (EAA) mechanism is proposed to remove influence of the trivial cases which could substantially weaken model performance. Our method achieves competitive results on multiple applications of point cloud. In particular, our method gets 85.8% accuracy on ScanObjectNN, and substantial performance gains up to 2.7% in DCGNN, 3.1% in PointNet++, and 2.4% in GBNet. Our code is available at <span>https://github.com/YangFengSEU/CEDR</span><svg><path></path></svg>.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"125 ","pages":"Article 117129"},"PeriodicalIF":3.5,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140767552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-12DOI: 10.1016/j.image.2024.117128
Daehyeon Kong , Kyeongbo Kong , Suk-Ju Kang
In recent years, deep neural networks pretrained on large-scale datasets have been used to address data deficiency and achieve better performance through prior knowledge. Contrastive language–image pretraining (CLIP), a vision-language model pretrained on an extensive dataset, achieves better performance in image recognition. In this study, we harness the power of multimodality in image clustering tasks, shifting from a single modality to a multimodal framework using the describability property of image encoder of the CLIP model. The importance of this shift lies in the ability of multimodality to provide richer feature representations. By generating text centroids corresponding to image features, we effectively create a common descriptive language for each cluster. It generates text centroids assigned by the image features and improves the clustering performance. The text centroids use the results generated by using the standard clustering algorithm as a pseudo-label and learn a common description of each cluster. Finally, only text centroids were added when the image features on the same space were assigned to the text centroids, but the clustering performance improved significantly compared to the standard clustering algorithm, especially on complex datasets. When the proposed method is applied, the normalized mutual information score rises by 32% on the Stanford40 dataset and 64% on ImageNet-Dog compared to the -means clustering algorithm.
{"title":"Image clustering using generated text centroids","authors":"Daehyeon Kong , Kyeongbo Kong , Suk-Ju Kang","doi":"10.1016/j.image.2024.117128","DOIUrl":"https://doi.org/10.1016/j.image.2024.117128","url":null,"abstract":"<div><p>In recent years, deep neural networks pretrained on large-scale datasets have been used to address data deficiency and achieve better performance through prior knowledge. Contrastive language–image pretraining (CLIP), a vision-language model pretrained on an extensive dataset, achieves better performance in image recognition. In this study, we harness the power of multimodality in image clustering tasks, shifting from a single modality to a multimodal framework using the describability property of image encoder of the CLIP model. The importance of this shift lies in the ability of multimodality to provide richer feature representations. By generating text centroids corresponding to image features, we effectively create a common descriptive language for each cluster. It generates text centroids assigned by the image features and improves the clustering performance. The text centroids use the results generated by using the standard clustering algorithm as a pseudo-label and learn a common description of each cluster. Finally, only text centroids were added when the image features on the same space were assigned to the text centroids, but the clustering performance improved significantly compared to the standard clustering algorithm, especially on complex datasets. When the proposed method is applied, the normalized mutual information score rises by 32% on the Stanford40 dataset and 64% on ImageNet-Dog compared to the <span><math><mi>k</mi></math></span>-means clustering algorithm.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"125 ","pages":"Article 117128"},"PeriodicalIF":3.5,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140555139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-30DOI: 10.1016/j.image.2024.117120
Yanjun Liu, Wenming Yang
Indoor 3D object detection is an essential task in single image scene understanding, impacting spatial cognition fundamentally in visual reasoning. Existing works on 3D object detection from a single image either pursue this goal through independent predictions of each object or implicitly reason over all possible objects, failing to harness relational geometric information between objects. To address this problem, we propose a sparse graph-based pipeline named Explicit3D based on object geometry and semantics features. Taking the efficiency into consideration, we further define a relatedness score and design a novel dynamic pruning method via group sampling for sparse scene graph generation and updating. Furthermore, our Explicit3D introduces homogeneous matrices and defines new relative loss and corner loss to model the spatial difference between target pairs explicitly. Instead of using ground-truth labels as direct supervision, our relative and corner loss are derived from homogeneous transforms, which renders the model to learn the geometric consistency between objects. The experimental results on the SUN RGB-D dataset demonstrate that our Explicit3D achieves better performance balance than the-state-of-the-art.
室内三维物体检测是单幅图像场景理解中的一项重要任务,对视觉推理中的空间认知有着根本性的影响。现有的单幅图像三维物体检测方法要么是通过对每个物体进行独立预测来实现这一目标,要么是对所有可能的物体进行隐式推理,无法利用物体之间的几何关系信息。为了解决这个问题,我们提出了一种基于稀疏图的管道,命名为基于物体几何和语义特征的 Explicit3D。考虑到效率问题,我们进一步定义了相关性得分,并设计了一种新颖的动态剪枝方法,通过分组采样来生成和更新稀疏场景图。此外,我们的 Explicit3D 还引入了同质矩阵,并定义了新的相对损失和角损失,以明确模拟目标对之间的空间差异。我们的相对损失和边角损失不是使用地面真实标签作为直接监督,而是从同质变换中导出,从而使模型能够学习物体之间的几何一致性。在 SUN RGB-D 数据集上的实验结果表明,我们的 Explicit3D 比最先进的技术取得了更好的性能平衡。
{"title":"Explicit3D: Graph network with spatial inference for single image 3D object detection","authors":"Yanjun Liu, Wenming Yang","doi":"10.1016/j.image.2024.117120","DOIUrl":"https://doi.org/10.1016/j.image.2024.117120","url":null,"abstract":"<div><p>Indoor 3D object detection is an essential task in single image scene understanding, impacting spatial cognition fundamentally in visual reasoning. Existing works on 3D object detection from a single image either pursue this goal through independent predictions of each object or implicitly reason over all possible objects, failing to harness relational geometric information between objects. To address this problem, we propose a sparse graph-based pipeline named Explicit3D based on object geometry and semantics features. Taking the efficiency into consideration, we further define a relatedness score and design a novel dynamic pruning method via group sampling for sparse scene graph generation and updating. Furthermore, our Explicit3D introduces homogeneous matrices and defines new relative loss and corner loss to model the spatial difference between target pairs explicitly. Instead of using ground-truth labels as direct supervision, our relative and corner loss are derived from homogeneous transforms, which renders the model to learn the geometric consistency between objects. The experimental results on the SUN RGB-D dataset demonstrate that our Explicit3D achieves better performance balance than the-state-of-the-art.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"124 ","pages":"Article 117120"},"PeriodicalIF":3.5,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140533524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-20DOI: 10.1016/j.image.2024.117117
Israr Hussain , Shunquan Tan , Jiwu Huang
Cropping an image is a common image editing technique that aims to find viewpoints with suitable image composition. It is also a frequently used post-processing technique to reduce the evidence of tampering in an image. Detecting cropped images poses a significant challenge in the field of digital image forensics, as the distortions introduced by image cropping are often imperceptible to the human eye. Although deep neural networks achieve state-of-the-art performance, due to their ability to encode large-scale data and handle billions of model parameters. However, due to their high computational complexity and substantial storage requirements, it is difficult to deploy these large deep learning models on resource-constrained devices such as mobile phones and embedded systems. To address this issue, we propose a lightweight deep learning framework for cropping detection in spatial domain, based on knowledge distillation. Initially, we constructed four datasets containing a total of 60,000 images cropped using various tools. We then used Efficient-Net-B0, pre-trained on ImageNet with significant surgical adjustments, as the teacher model, which makes it more robust and faster to converge in this downstream task. The model was trained on 20,000 cropped and uncropped images from our own dataset, and we then applied its knowledge to a more compact model called the student model. Finally, we selected the best-performing lightweight model as the final prediction model, with a testing accuracy of 98.44% on the test dataset, which outperforms other methods. Extensive experiments demonstrate that our proposed model, distilled from Efficient-Net-B0, achieves state-of-the-art performance in terms of detection accuracy, training parameters, and FLOPs, outperforming existing methods in detecting cropped images.
{"title":"A knowledge distillation based deep learning framework for cropped images detection in spatial domain","authors":"Israr Hussain , Shunquan Tan , Jiwu Huang","doi":"10.1016/j.image.2024.117117","DOIUrl":"10.1016/j.image.2024.117117","url":null,"abstract":"<div><p>Cropping an image is a common image editing technique that aims to find viewpoints with suitable image composition. It is also a frequently used post-processing technique to reduce the evidence of tampering in an image. Detecting cropped images poses a significant challenge in the field of digital image forensics, as the distortions introduced by image cropping are often imperceptible to the human eye. Although deep neural networks achieve state-of-the-art performance, due to their ability to encode large-scale data and handle billions of model parameters. However, due to their high computational complexity and substantial storage requirements, it is difficult to deploy these large deep learning models on resource-constrained devices such as mobile phones and embedded systems. To address this issue, we propose a lightweight deep learning framework for cropping detection in spatial domain, based on knowledge distillation. Initially, we constructed four datasets containing a total of 60,000 images cropped using various tools. We then used Efficient-Net-B0, pre-trained on ImageNet with significant surgical adjustments, as the teacher model, which makes it more robust and faster to converge in this downstream task. The model was trained on 20,000 cropped and uncropped images from our own dataset, and we then applied its knowledge to a more compact model called the student model. Finally, we selected the best-performing lightweight model as the final prediction model, with a testing accuracy of 98.44% on the test dataset, which outperforms other methods. Extensive experiments demonstrate that our proposed model, distilled from Efficient-Net-B0, achieves state-of-the-art performance in terms of detection accuracy, training parameters, and FLOPs, outperforming existing methods in detecting cropped images.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"124 ","pages":"Article 117117"},"PeriodicalIF":3.5,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140271577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-15DOI: 10.1016/j.image.2024.117116
Che-Wei Lee
For immediate access and online backup, cloud storage has become a mainstream way for digital data storage and distribution. To assure images accessed or downloaded from clouds are reliable is critical to storage service providers. In this study, a new distortion-free color image authentication method based on secret sharing, data compression and image interpolation with the tampering recovery capability is proposed. The proposed method generates elaborate authentication signals which have double functions of tampering localization and image repairing. The authentication signals are subsequently converted into many shares with the use of (k, n)-threshold method so as to increase the multiplicity of authentication signals for reinforcing the capability of tampering recovery. These shares are then randomly concealed in the alpha channel of the to-be-protected image that has been transformed into the PNG format containing RGBA channels. In authentication, the authentication signals computed from the alpha channel are not only used for indicating if an image block has been tampered with or not, but used as a signal to find the corresponding color in a predefined palette to recover the tampered image block. Compared with several state-of-the-art methods, the proposed method attains positive properties including losslessness, tampering localization and tampering recovery. Experimental results and discussions on security consideration and comparison with other related methods are provided to demonstrate the outperformance of the proposed method.
{"title":"A distortion-free authentication method for color images with tampering localization and self-recovery","authors":"Che-Wei Lee","doi":"10.1016/j.image.2024.117116","DOIUrl":"10.1016/j.image.2024.117116","url":null,"abstract":"<div><p>For immediate access and online backup, cloud storage has become a mainstream way for digital data storage and distribution. To assure images accessed or downloaded from clouds are reliable is critical to storage service providers. In this study, a new distortion-free color image authentication method based on secret sharing, data compression and image interpolation with the tampering recovery capability is proposed. The proposed method generates elaborate authentication signals which have double functions of tampering localization and image repairing. The authentication signals are subsequently converted into many shares with the use of (<em>k, n</em>)-threshold method so as to increase the multiplicity of authentication signals for reinforcing the capability of tampering recovery. These shares are then randomly concealed in the alpha channel of the to-be-protected image that has been transformed into the PNG format containing RGBA channels. In authentication, the authentication signals computed from the alpha channel are not only used for indicating if an image block has been tampered with or not, but used as a signal to find the corresponding color in a predefined palette to recover the tampered image block. Compared with several state-of-the-art methods, the proposed method attains positive properties including losslessness, tampering localization and tampering recovery. Experimental results and discussions on security consideration and comparison with other related methods are provided to demonstrate the outperformance of the proposed method.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"124 ","pages":"Article 117116"},"PeriodicalIF":3.5,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140204243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-15DOI: 10.1016/j.image.2024.117109
Li Xu , Yaodong Zhou , Bing Luo , Bo Li , Chao Zhang
Object cosegmentation aims to obtain common objects from multiple images or videos, which performs by employing handcraft features to evaluate region similarity or learning higher semantic information via deep learning. However, the former based on handcraft features is sensitive to illumination, appearance changes and clutter background to the domain gap. The latter based on deep learning needs the groundtruth of object segmentation to train the co-attention model to spotlight the common object regions in different domain. This paper proposes an adversarial domain adaption-based video object cosegmentation method without any pixel-wise supervision. Intuitively, high-level semantic similarity are beneficial for common object recognition. However, there are inconsistency distributions of different video sources, i.e., domain gap. We propose an adversarial learning method to align feature distributions of different videos, which aims to maintain the feature similarity of common objects to overcome the dataset bias. Hence, a feature encoder via Siamese network is constructed to fool a discriminative network to obtain domain adapted feature mapping. To further assist the feature embedding of common objects, we define a latent task for label generation to train a classifying network, which could make full use of high-level semantic information. Experimental results on several video cosegmentation datasets suggest that domain adaption based on adversarial learning could significantly improve the common semantic feature exaction.
{"title":"Adversarial domain adaptation with Siamese network for video object cosegmentation","authors":"Li Xu , Yaodong Zhou , Bing Luo , Bo Li , Chao Zhang","doi":"10.1016/j.image.2024.117109","DOIUrl":"10.1016/j.image.2024.117109","url":null,"abstract":"<div><p>Object cosegmentation aims to obtain common objects from multiple images or videos, which performs by employing handcraft features to evaluate region similarity or learning higher semantic information via deep learning. However, the former based on handcraft features is sensitive to illumination, appearance changes and clutter background to the domain gap. The latter based on deep learning needs the groundtruth of object segmentation to train the co-attention model to spotlight the common object regions in different domain. This paper proposes an adversarial domain adaption-based video object cosegmentation method without any pixel-wise supervision. Intuitively, high-level semantic similarity are beneficial for common object recognition. However, there are inconsistency distributions of different video sources, i.e., domain gap. We propose an adversarial learning method to align feature distributions of different videos, which aims to maintain the feature similarity of common objects to overcome the dataset bias. Hence, a feature encoder via Siamese network is constructed to fool a discriminative network to obtain domain adapted feature mapping. To further assist the feature embedding of common objects, we define a latent task for label generation to train a classifying network, which could make full use of high-level semantic information. Experimental results on several video cosegmentation datasets suggest that domain adaption based on adversarial learning could significantly improve the common semantic feature exaction.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"123 ","pages":"Article 117109"},"PeriodicalIF":3.5,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139897019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-22DOI: 10.1016/j.image.2023.117087
Joan Bartrina-Rapesta , Miguel Hernández-Cabronero , Victor Sanchez , Joan Serra-Sagristà , Pouya Jamshidi , J. Castellani
Online collaborative tools for medical diagnosis produced from digital pathology images have experimented an increase in demand in recent years. Due to the large sizes of pathology images, rate control (RC) techniques that allow an accurate control of compressed file sizes are critical to meet existing bandwidth restrictions while maximizing retrieved image quality. Recently, some RC contributions to Region of Interest (RoI) coding for pathology imaging have been presented. These encode the RoI without loss and the background with some loss, and focus on providing high RC accuracy for the background area. However, none of these RC contributions deal efficiently with arbitrary RoI shapes, which hinders the accuracy of background definition and rate control. This manuscript presents a novel coding system based on prediction with a novel RC algorithm for RoI coding that allows arbitrary RoIs shapes. Compared to other methods of the state of the art, our proposed algorithm significantly improves upon their RC accuracy, while reducing the compressed data rate for the RoI by 30%. Furthermore, it offers higher quality in the reconstructed background areas, which has been linked to better clinical performance by expert pathologists. Finally, the proposed method also allows lossless compression of both the RoI and the background, producing data volumes 14% lower than coding techniques included in DICOM, such as HEVC and JPEG-LS.
近年来,利用数字病理图像进行医学诊断的在线协作工具的需求不断增加。由于病理图像体积庞大,能准确控制压缩文件大小的速率控制(RC)技术对于满足现有带宽限制同时最大限度地提高检索图像质量至关重要。最近,一些针对病理成像的感兴趣区(RoI)编码的速率控制技术已经问世。这些方法对感兴趣区(RoI)进行无损编码,对背景进行有损编码,并侧重于为背景区域提供较高的 RC 精确度。然而,这些 RC 解决方案都不能有效处理任意形状的 RoI,这就妨碍了背景定义和速率控制的准确性。本手稿介绍了一种基于预测的新型编码系统,该系统采用新型 RC 算法进行 RoI 编码,允许任意形状的 RoI。与现有的其他方法相比,我们提出的算法大大提高了其 RC 精确度,同时将 RoI 的压缩数据率降低了 30%。此外,它还能提供更高质量的重建背景区域,这与病理专家更好的临床表现息息相关。最后,所提出的方法还能对RoI和背景进行无损压缩,产生的数据量比DICOM中的编码技术(如HEVC和JPEG-LS)低14%。
{"title":"Prediction-based coding with rate control for lossless region of interest in pathology imaging","authors":"Joan Bartrina-Rapesta , Miguel Hernández-Cabronero , Victor Sanchez , Joan Serra-Sagristà , Pouya Jamshidi , J. Castellani","doi":"10.1016/j.image.2023.117087","DOIUrl":"10.1016/j.image.2023.117087","url":null,"abstract":"<div><p>Online collaborative tools for medical diagnosis produced from digital pathology images have experimented an increase in demand in recent years. Due to the large sizes of pathology images, rate control (RC) techniques that allow an accurate control of compressed file sizes are critical to meet existing bandwidth restrictions while maximizing retrieved image quality. Recently, some RC contributions to Region of Interest (RoI) coding for pathology imaging have been presented. These encode the RoI without loss and the background with some loss, and focus on providing high RC accuracy for the background area. However, none of these RC contributions deal efficiently with arbitrary RoI shapes, which hinders the accuracy of background definition and rate control. This manuscript presents a novel coding system based on prediction with a novel RC algorithm for RoI coding that allows arbitrary RoIs shapes. Compared to other methods of the state of the art, our proposed algorithm significantly improves upon their RC accuracy, while reducing the compressed data rate for the RoI by 30%. Furthermore, it offers higher quality in the reconstructed background areas, which has been linked to better clinical performance by expert pathologists. Finally, the proposed method also allows lossless compression of both the RoI and the background, producing data volumes 14% lower than coding techniques included in DICOM, such as HEVC and JPEG-LS.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"123 ","pages":"Article 117087"},"PeriodicalIF":3.5,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0923596523001698/pdfft?md5=50ed387eb780e2fb3882b0d9944d5133&pid=1-s2.0-S0923596523001698-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139558222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The task of binarization of historical document images has been in the forefront of image processing research, during the digital transition of libraries. The process of storing and transcribing valuable historical printed or handwritten material can salvage world cultural heritage and make it available online without physical attendance. The task of binarization can be viewed as a pre-processing step that attempts to separate the printed/handwritten characters in the image from possible noise and stains, which will assist in the Optical Character Recognition (OCR) process. Many approaches have been proposed before, including deep learning based approaches. In this article, we propose a U-Net style deep learning architecture that incorporates many other developments of deep learning, including residual connections, multi-resolution connections, visual attention blocks and dilated convolution blocks for upsampling. The novelties in the proposed DMVAnet lie in the use of these elements in combination in a novel U-Net style architecture and the application of DMVAnet in image binarization for the first time. In addition, the proposed DMVAnet is a very computationally lightweight network that performs very close or even better than the state-of-the-art approaches with a fraction of the network size and parameters. Finally, it can be used on platforms with restricted processing power and system resources, such as mobile devices and through scaling can result in inference times that allow for real-time applications.
{"title":"A Dilated MultiRes Visual Attention U-Net for historical document image binarization","authors":"Nikolaos Detsikas, Nikolaos Mitianoudis, Nikolaos Papamarkos","doi":"10.1016/j.image.2024.117102","DOIUrl":"10.1016/j.image.2024.117102","url":null,"abstract":"<div><p><span><span>The task of binarization of historical </span>document images<span> has been in the forefront of image processing research, during the digital transition of libraries. The process of storing and transcribing valuable historical printed or handwritten material can salvage world cultural heritage and make it available online without physical attendance. The task of binarization can be viewed as a pre-processing step that attempts to separate the printed/handwritten characters in the image from possible noise and stains, which will assist in the </span></span>Optical Character Recognition<span><span> (OCR) process. Many approaches have been proposed before, including deep learning based approaches. In this article, we propose a U-Net style deep learning architecture that incorporates many other developments of deep learning, including residual connections, multi-resolution connections, visual attention blocks and dilated convolution blocks for upsampling. The novelties in the proposed DMVAnet lie in the use of these elements in combination in a novel U-Net style architecture and the application of DMVAnet in image binarization for the first time. In addition, the proposed DMVAnet is a very computationally lightweight network that performs very close or even better than the state-of-the-art approaches with a fraction of the network size and parameters. Finally, it can be used on platforms with restricted processing power and system resources, such as </span>mobile devices and through scaling can result in inference times that allow for real-time applications.</span></p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"122 ","pages":"Article 117102"},"PeriodicalIF":3.5,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139470349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anomaly detection in multimedia datasets is a widely studied area. Yet, the concept drift challenge in data has been ignored or poorly handled by the majority of the anomaly detection frameworks. The state-of-the-art approaches assume that the data distribution at training and deployment time will be the same. However, due to various real-life environmental factors, the data may encounter drift in its distribution or can drift from one class to another in the late future. Thus, a one-time trained model might not perform adequately. In this paper, we systematically investigate the effect of concept drift on various detection models and propose a modified Adaptive Gaussian Mixture Model (AGMM) based framework for anomaly detection in multimedia data. In contrast to the baseline AGMM, the proposed extension of AGMM remembers the past for a longer period in order to handle the drift better. Extensive experimental analysis shows that the proposed model better handles the drift in data as compared with the baseline AGMM. Further, to facilitate research and comparison with the proposed framework, we contribute three multimedia datasets constituting faces as samples. The face samples of individuals correspond to the age difference of more than ten years to incorporate a longer temporal context.
{"title":"Concept drift challenge in multimedia anomaly detection: A case study with facial datasets","authors":"Pratibha Kumari , Priyankar Choudhary , Vinit Kujur , Pradeep K. Atrey , Mukesh Saini","doi":"10.1016/j.image.2024.117100","DOIUrl":"10.1016/j.image.2024.117100","url":null,"abstract":"<div><p>Anomaly detection<span> in multimedia datasets is a widely studied area. Yet, the concept drift challenge in data has been ignored or poorly handled by the majority of the anomaly detection frameworks. The state-of-the-art approaches assume that the data distribution at training and deployment time will be the same. However, due to various real-life environmental factors, the data may encounter drift in its distribution or can drift from one class to another in the late future. Thus, a one-time trained model might not perform adequately. In this paper, we systematically investigate the effect of concept drift on various detection models and propose a modified Adaptive Gaussian Mixture Model (AGMM) based framework for anomaly detection in multimedia data. In contrast to the baseline AGMM, the proposed extension of AGMM remembers the past for a longer period in order to handle the drift better. Extensive experimental analysis shows that the proposed model better handles the drift in data as compared with the baseline AGMM. Further, to facilitate research and comparison with the proposed framework, we contribute three multimedia datasets constituting faces as samples. The face samples of individuals correspond to the age difference of more than ten years to incorporate a longer temporal context.</span></p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"123 ","pages":"Article 117100"},"PeriodicalIF":3.5,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139423693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}